Search CORE

81 research outputs found

Formal Verification of Parallel Stream Compaction and Summed-Area Table Algorithms

Author: A Amighi
B Jacobs
D Horn
GE Blelloch
J Boyland
L de Moura
M Harris
M Safari
M Zheng
P Collingbourne
P Müller
PM Kogge
S Blom
S Blom
Publication venue: Springer
Publication date: 25/11/2020
Field of study

Crossref

University of Twente Research Information

GPU-Accelerated Large-Eddy Simulation of Turbulent Channel Flows

Author: Antoniou A.S.
Briggs W. L.
Cheng W.
Chorin A.J.
Chung D.
Deardorff J.W.
Driest E.V.
Geveler M.
Griebel M.
Hoyas S.
Jacobsen D.A.
Jacobsen D.A.
Jacobsen D.A.
Kerr A.
Kogge P. M.
Meneveau C.
Smagorinksy J.
The Portland Group
Thibault J.C.
Publication venue: 'IUScholarWorks'
Publication date: 09/01/2012
Field of study

High performance computing clusters that are augmented with cost and power efficient graphics processing unit (GPU) provide new opportunities to broaden the use of large-eddy simulation technique to study high Reynolds number turbulent flows in fluids engineering applications. In this paper, we extend our earlier work on multi-GPU acceleration of an incompressible Navier-Stokes solver to include a large-eddy simulation (LES) capability. In particular, we implement the Lagrangian dynamic subgrid scale model and compare our results against existing direct numerical simulation (DNS) data of a turbulent channel flow at Reτ = 180. Overall, our LES results match fairly well with the DNS data. Our results show that the Reτ = 180 case can be entirely simulated on a single GPU, whereas higher Reynolds cases can benefit from a GPU cluster

Crossref

Boise State University - ScholarWorks

Active memory controller

Author: A Ailamaki
A Gottlieb
A Saulsbury
Ali Ibrahim
C Batten
C Cascaval
D Kim
D Patterson
DH Albonesi
DJ Sorin
DJ Sorin
DS Nikolopoulos
F Petrini
G Blelloch
G Marin
I Zotov
J Kuskin
J Laudon
J Torrellas
J Torrellas
JB Brockman
JH Ahn
JM Mellor-Crummey
John B. Carter
K Keeton
KM Chandy
L Zhang
L Zhang
L Zhao
LA Barroso
Lixin Zhang
M Garzaran
M Hall
M Hao
M Oskin
Michael A. Parker
P Kogge
PA Boncz
R Kalla
RE Kessler
S Chatterjee
S Kumar
S Scott
Sally A. McKee
T Anderson
T Eicken von
V Tipparaju
Xiaowei Jiang
Y Solihin
Y Solihin
Z Fang
Zhen Fang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs\u27 performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation

Crossref

Chalmers Research

Future research directions in design of reliable communication systems

Author: A Avižienis
A Casaca
A Gumaste
A Somani
AMCA Koster
AP Wierzbicki
Arie M. C. A. Koster
B Ahlgren
C Büsing
C Büsing
C Develder
D Alderson
D Bertsimas
D Bertsimas
D Colle
D Dörner
D-L Truong
Dimitri Staessens
E Marshall
Egemen K. Çetinkaya
G Claßen
G Karagiannis
G Maier
GJ Holzmann
H Haddadi
H Hartenstein
J Blum
J Clímaco
J Gozalvez
J Rak
J Rak
J Rak
J Tapolcai
Jacek Rak
James P. G. Sterbenz
Javier Alonso Lopez
JM Coutinho-Rodrigues
JPG Sterbenz
K Kanoun
K Walkowiak
K Walkowiak
K Walkowiak
Kishor S. Trivedi
Krzysztof Walkowiak
L Iannone
LM Contreras
LW Beineke
M Fischetti
M Grottke
M Grottke
M Jinno
M Klinkowski
Mario Pickavet
Matthias Gunkel
MC Balbuena
ML Sichitiu
P Kogge
Q Zhang
R Bhandari
R Crane
R Matos
RE Steuer
S Maesschalck De
S Ramamurthy
S Sen
S Zeadally
T Gomes
Teresa Gomes
TJ Xia
V Castelli
W Molisz
W Molisz
W Venters
X Huang
Y Guo
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Optimization of high-performance superscalar architectures for energy efficiency

Author: P. Kogge
V. Zyuban
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

Crossref

Parallel Solution of Recurrence Problems

Author: P. M. Kogge
Publication venue: IEEE
Publication date: 01/01/1974
Field of study

Abstract:. An mth-order recurrence problem is defined as the computation of the sequence x,;.., xN, where xi =f(ai, xi-,;. and ai,is some vector of parameters. This paper investigates general algorithms for solving such problems on highly parallel computers. We show that if the recurrence functionfhas associated with it two other functions that satisfy certain composition properties, then we can construct elegant and efficient parallel algorithms that can compute all N elements of the series in time proportional to [log,N]. The class of problems having this property includes linear recurrences of all orders- both homogeneous and inhomogeneous, recurrences involving matrix or binary quantities, and various nonlinear problems involving operations such as computation with matrix inverses, exponentiation, and modulo division

CiteSeerX

The energy complexity of register files

Author: P. Kogge
V. Zyuban
Publication venue
Publication date: 01/01/1998
Field of study

Register files represent a substantial portion of the energy budget in modern processors, and are growing rapidly with the trend towards larger Instruction Level Parallelism (ILP). The energy cost of a register file access depends greatly on the register file circuitry used. This paper compares various register file circuitry techniques for their energy efficiencies, as a function of the architectural parameters such as the number of registers and the number of ports. The Port Priority Selection technique combined with differential reads and low-swing writes was found to be the most energy efficient and provided significant energy savings compared to traditional approaches in the case of large register files. The dependence of register file access energy upon technology scaling is also studied. However, as this paper shows, it appears that none of these will be enough to prevent centralized register files from becoming the dominant power component of next-generation superscalar computers, and alternative methods for inter-instruction communication need to be developed

CiteSeerX

Crossref

Should we worry about memory loss?

Author: Dunning D.
Hennessy J.
Kogge P.
O. Perks
S. A. Jarvis
S. D. Hammond
S. J. Pennycook
Simon H. D.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/11/2010
Field of study

In recent years the High Performance Computing (HPC) industry has benefited from the development of higher density multi-core processors. With recent chips capable of executing up to 32 tasks in parallel, this rate of growth also shows no sign of slowing. Alongside the development of denser micro-processors has been the considerably more modest rate of improvement in random access memory (RAM) capacities. The effect has been that the available memory-per-core has reduced and current projections suggest that this is still set to reduce further. In this paper we present three studies into the use and measurement of memory in parallel applications; our aim is to capture, understand and, if possible, reduce the memory-per-core needed by complete, multi-component applications. First, we present benchmarked memory usage and runtimes of a six scientific benchmarks, which represent algorithms that are common to a host of production-grade codes. Memory usage of each benchmark is measured and reported for a variety of compiler toolkits, and we show greater than 30% variation in memory high-water mark requirements between compilers. Second, we utilise this benchmark data combined with runtime data, to simulate, via the Maui scheduler simulator, the effect on a multi-science workflow if memory-per-core is reduced from 1.5GB-per-core to only 256MB. Finally, we present initial results from a new memory profiling tool currently in development at the University of Warwick. This tool is applied to a finite-element benchmark and is able to map high-water-mark memory allocations to individual program functions. This demonstrates a lightweight and accurate method of identifying potential memory problems, a technique we expect to become commonplace as memory capacities decrease

Crossref

Warwick Research Archives Portal Repository

Optimization of high-performance superscalar architectures for energy efficiency

Author: P. Kogge
V. Zyuban
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2000
Field of study

In recent years reducing power has become a critical design goal for high-performance microprocessors. This work attempts to bring the power issue to the earliest phase of high-performance mi-croprocessor development. We propose a methodology for power-optimization at the micro-architectural level. First, major targets for power reduction are identified within superscalar microarchi-tecture, then an optimization of a superscalar micro-architecture is performed that generates a set of energy-efficient configura-tions forming a convex hull in the power-performance space. The energy-efficient families are then compared to find configurations that dissipate the lowest power given a performance target, or, conversely, deliver the highest performance given a power budget. Application of the developed methodology to a superscalar micro-architecture shows that at the architectural level there is a potential for reducing power up to 50%, given a performance requirement, and for up to 15 % performance improvement, given a power bud-get

CiteSeerX

Crossref